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[57] 



ABSTRACT 



A structured document generating method and apparatus 
capable of easily generating a structured document matching 
the document structure of each non-structured document, by 
using a rule directly generated from a preset document 
structure definition for the conversion of the non-structured 
document into the structured document. A keyword extract- 
ing module extracts a keyword representative of the docu- 
ment structure from a non-structured document by using a 
keyword extracting rule, and a keyword/text model is gen- 
erated which is described by two elements including key- 
words and other strings. A parsing module generated by a 
process of automatically parsing the document structure by 
referring to a parsing rule generated by modifying and 
converting DTD, performs a parsing process relative to the 
keyword/text model to generate an interim SGML docu- 
ment. An SGML document correcting module modifies the 
interim SGML document and generates a final output of an 
SGML document by referring to DTD different information 
generated when the parsing rule was generated. 
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FIG. 15 

<LAW> 

< CHANGE > 

< OPENINGTITLE > 

OAAPREFECTURE FLOOD DEFENCE SIGNAL REGULATION 

< / OPENINGTITLE > 
1501 PROMULGATION > 
1502"^ <PR OMLJLGATIONDATE > 

SHOWA 24, OCTOBER, 6 
1 503— ^< / PROMULGATIONDATE > 

< ESTABLISHEDREGULATIONNO. > 
AAPREFECTURE REGULATION NO. 78 

< / ESTABLISHEDREGULATIONNO. > 

1 504 — ^ — < PROMULGATIONSTATEMENT > 

AAPREFECTURE FLOOD DEFENCE SIGNAL REGULATION IS 
TO BE PROMULGATED AS IN THE FOLLOWING 

1 505 — - — < / PROMULGATIONSTATEMENT > 

1 506 < I PROMULGATION > 

< TITLE > 

AAPREFECTURE FLOOD DEFENCE SIGNAL REGULATION 

< / TITLE > 

< / CHANGE > 

< PRESENTREGULATION > 

< ARTICLE > 

< ARTICLENO. > 
ARTICLE 1 

</ ARTICLENO. > 

< FIRSTPARAGRAPH > 

< FIRSTPARAGRAPHSTATEMENT > 

FLOOD DEFENCE SIGNALS STIPULATED IN ARTICLE 13. 
PARAGRAPH 1 OF THE FLOOD DEFENCE LAW 
(SHOWA 24, JUNE, LAW NO. 193) INCLUDE THE FOLLWING. 

< / FIRSTPARAGRAPHSTATEMENT > 

< PARAGRAPH > 

< PARAGRAPHNO. > 
( 1 ) 

< / PARAGRAPHNO. > 

< PARAGRAPHSTATEMENT > 

FIRST SIGNAL : FOR NOTIFYING AN ALARM WATER LEVEL 

< / PARAGRAPHSTATEMENT > 

< / PARAGRAPH > 
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FIG. 17 

<LAW> 

< PROMULGATION > 

< PROMULGATIONSTATEMENT > 

AAPREFECTURE FLOOD DEFENCE SIGNAL REGULATION IS 
TO BE PROMULGATED AS IN THE FOLLOWING 

< / PROMULGATIONSTATEMENT > 

< PROMULGATIONDATE > 
SHOWA 24, OCTOBER. 6 

< / PROMULGATIONDATE > 

< PROMULGATIONOFFICER > 

< OFFICIALTITLE > 
[ NONE ] 

< / OFFICIALTITLE > 

< NAME > 
[NONE ] 
</ NAME > 

</ PROMULGATIONOFFICER > 
</ PROMULGATION > 

< ESTABLISHEDREGULATIONNO. > 
AAPREFECTURE REGULATION NO. 78 

</ ESTABLISHEDREGULATIONNO. > 

< TITLE > 

AAPREFECTURE FLOOD DEFENCE SIGNAL REGULATION 
</ TITLE > 

< PRESENTREGULATION > 

< ARTICLE > 

< ARTICLENO. > 
ARTICLE 1 

< /ARTICLENO. > 

< FIRSTPARAGRAPH > 

< FIRSTPARAGRAPHSTATEMENT > 

FLOOD DEFENCE SIGNALS STIPULATED IN ARTICLE 13, 
PARAGRAPH 1 OF THE FLOOD DEFENCE LAW 
(SHOWA 24, JUNE, LAW NO. 193) INCLUDE THE FOLLWING. 

< / FIRSTPARAGRAPHSTATEMENT > 

< PARAGRAPH > 

< PARAGRAPHNO. > 
(1) 

</ PARAGRAPHNO. > 

< PARAGRAPHSTATEMENT > 

FIRST SIGNAL : FOR NOTIFYING AN ALARM WATER LEVEL 
< / PARAGRAPHSTATEMENT > 
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FIG.21 
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FIG.24 
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FIG.28 
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METHOD AND APPARATUS FOR 
GENERATING STRUCTURED DOCUMENT 

CROSS-REFERENCE TO RELATED 
APPLICATIONS 

This application relates to a U.S. application Sen No. 
08/657,306 filed by Y AOYAMA et al on Jun. 3, 1996 now 
U.S. Pal. No. 5,956,726, entitled "Method and Apparatus for 
Structured Document Difference String Extraction" and 
assigned to the present assignee. The disclosure of that 
application is incorporated herein by reference. 

BACKGROUND OF THE INVENTION 

1. Field of the Invention 

The present invention generally relates to management of 
documents having a regular document format such as legal 
documents, and particularly to a method and apparatus for 
generating a structured document from a non-structured 
document. The "non-structured document" means a docu- 
ment which does not contain information explicitly showing 
the structure of a document entered through character 
recognition, a word processor, or the like. The "structured 
document" is a document which contains information 
explicitly showing the structure of the document. 

2. Description of the Related Art 

In a known method of generating a structured document, 
information explicitly showing the document structure is 
embedded in a text. Generally, a document generated by a 
user (hereinafter called a "document instance") often con- 
tains a portion for designating a file which describes a 
document structure definition and a text content portion. The 
document structure definition defines the document structure 
and a mark indicating an element (the mark is hereinafter 
called a "tag"). The document structure definition is often set 
in order to efficiently use a document to be structured. The 
tag defined by the document structure definition is inserted 
into the text content portion in order to explicitly express the 
document structure and uniquely determine a string which is 
an element of the document structure indicated by the tag. 

In outputting a document instance structured in the above 
manner, an image to be output is generated by referring to 
a file which describes a layout definition defining what 
format is used for outputting each component (hereinafter 
called an "element") of the document structure. In this 
method, the document instance and the layout definition are 
independent so that any document instance can be used 
irrespective of the type of an apparatus or system to be used 
for the output 

The contents of a string of a structured document are 
explicitly expressed by inserting a tag such as <author 
name> and <title> which is in one-to-one correspondence 
with an element Therefore, in combination with a tool such 
as a full text search system for structured documents, an 
aggregation of document instances themselves can be used 
as a database, and the document contents can be added or 
changed easily. Even if part of this database is lost by some 
failure, it is possible to know that this database has a lost 
portion, by comparing the original document structure defi- 
nitions with the database of document instances. 

Because of these advantages, structured documents are 
widely used for document management of a document 
processing system which stores and uses a large number of 
documents. Along with this, several approaches have been 
proposed to convert a non -structured document such as 
already present paper documents and documents entered by 
a word processor, into a structured document. 
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JP-A-62-249270 and "Method of Converting Document 
Image into ODA Structured Document" (Journal of Papers 
of The Institute of Electronics, Information and Communi- 
cation Engineers, D-ll Vol. J76-D11 No. 11 pp. 2274-2284) 
5 propose the following method. First, the field of a document 
type of a document is restricted. Next, a structured document 
is generated by using a document structure common in the 
restricted field (hereinafter called a "common document 
structure") and a document structure analysis rule. 

10 With this method, the document structure usable in com- 
mon in each field of a document such as "technical docu- 
ment" and "business document" is set Then, the document 
structure analysis rule is manually generated in order to 
analyze a non-structured document and extract a document 

15 structured of it. By using the document structure analysis 
rule, the non-structured document is converted into a docu- 
ment instance matching the common document structure. If 
there is an element, which is specific to each document 
structure and unable to be expressed by the common docu- 

20 ment structure (hereinafter called an "individual document 
structure"), the document instance matching the common 
document structure is converted into a document instance 
matching the individual document structure. 

With this method, however, the document structure sub- 

25 jected to the document structure analysis and the document 
structure analysis rule are dependent upon the field of a 
non-structured document. Therefore, in order to process a 
document in a different field, the document structure analy- 
sis rule for this field is required to be newly generated 

30 manually. This work requires a large amount of labor. 

This method uses a single document structure analysis 
rule considered to have high commonness in a plurality type 
of documents in a specific field. Therefore, this single 

35 document structure analysis rule is not always optimum to 
each document and an element specific to an individual 
document structure cannot be analyzed directly. In this case, 
it becomes necessary after the document structure analysis 
to convert again the document instance into another docu- 

4Q ment instance matching the individual document structure. 
Specifically, tags of the first generated document instance 
are added, changed, or deleted. This work generally requires 
complicated operations and hence a large amount of labor. 
Further, this method does not consider a support to 

45 generate a rule for extracting a keyword. Therefore, an 
element as a keyword is required to be manually determined 
and the conditions of layout and string necessary for extract- 
ing a keyword is also required to be manually set 

Still further, this method does not provide means for 

50 supporting to determine an element as a keyword 
(hereinafter called a "keyword -corresponding element"). 
Elements which contain string data are not always extracted 
as keywords. Elements having no characteristic layout or 
string are not extracted as keywords, but they are dealt as a 

55 string between keywords, i.e., a non-keyword. 

The restriction condition that "non-keywords should not 
be contiguous in a document instance" is imposed when 
which element is determined to be a keyword-corresponding 
element This is because the non -keyword is a "string 

60 between keywords" and the non-keyword is required to be 
always contiguous to a keyword. However, conventional 
methods have no means for automatically checking whether 
an aggregation of elements determined as keyword- 
corresponding elements satisfies the restriction condition. If 

65 the aggregation of these keyword-corresponding elements 
does not satisfy the restriction condition, some defective or 
erroneous conditions occur when the rule for document 
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structure analysis is generated or when the document struc- 
ture is analyzed. It is therefore necessary to determine again 
keyword-corresponding elements. This cycle is required to 
be repeated until an aggregation of proper keyword- 
corresponding elements is set. 5 

Lastly, this method does not support to set the conditions 
of layout and string necessary for the extraction of a key- 
word. It is therefore necessary to manually collect informa- 
tion necessary for the extraction of a keyword from a 
non-structured document itself or rules or the like defining 10 
the format of the non-structured document. This requires a 
large amount of labor. 

JP-A-6-290173 gives the following description. A docu- 
ment structure indicating each element of a labeled docu- 
ment is generated by referring to a "schema" describing 35 
restricting information of the document structure, and then 
a structured document is generated. 

In JP-A-6-290.173, however, although use of the schema 
describing restricting information of the document structure 
is described, how the schema is generated is not described. 20 

SUMMARY OF THE INVENTION 

It is an object of the invention to solve the above problems 
and enable proper document structure analysis of documents 
of a plurality of fields. 1S 

It is another object of the invention to directly analyze 
elements specific to the individual document structure and 
enable to directly generate a document instance matching 
the individual document structure. 

It is a further object of the invention to support to generate 30 
a rule for extracting a keyword. 

In order to achieve the above objects, the invention 
provides a method of generating a structured document for 
a structured document generating apparatus having at least 
an input/output device, a control unit, and a repository 35 
wherein a non-structured document not explicitly given the 
document structure and input from the input/output device is 
converted into a structured document explicitly given the 
document structure, in accordance with a document struc- 
ture definition defining the document structure, the method 40 
comprising the steps of: modifying a given first document 
structure definition so as to match the document structure of 
the input non-structured document and generate a second 
document structure definition; the control unit generating a 
parsing rule used for performing a parsing process suitable 45 
for the document structure of the second document structure 
definition, by modifying marks constituting the second 
document structure definition and modifying the second 
document structure definition so as to make the positional 
order of the marks in one-to-one correspondence; in accor- 50 
dance with the generated parsing rule, generating a first 
structured document from the input non-structured docu- 
ment; and in accordance with difference data between the 
first document structure definition and the second document 
structure definition, converting the generated first structured 55 
document into a format matching the first document struc- 
ture definition to thereby generate a second structured docu- 
ment. 

With the above configuration, conversion from the non- 
structured document to the structured document can be 60 
performed, for example, by a parsing module which ana- 
lyzes the document structure through parsing on the basis of 
extracted keywords. The parsing module is generated by 
converting a given document structure definition into a 
parsing rule by means of a parsing rule generating module, 65 
and by subjecting this parsing rule to a process of automati- 
cally generating a parsing module. 
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In the process of automatically generating a parsing 
module, an aggregation of rules such as "A is constituted by 
patterns B, C, . . . " is input and a program for executing a 
parsing process in accordance with these rules is output. A 
particular process to be executed when each rule is satisfied 
can be described in this program. Such a process of auto- 
matically generating a parsing module may be yacc, for 
example. 

With the above configuration, if the same string in the 
same string region is extracted as a plurality of different 
keywords, the parsing module of the control unit selects a 
proper one from the plurality of keywords in accordance 
with whether the parsing process succeeds or fails. 

A method of generating a structured document is per- 
formed in practice as in the following. First, a keyword 
extraction module extracts a keyword from the non- 
structured document, and generates a keyword/text model of 
an abstract which represents the non-structured document as 
an aggregation of elements constituted by keywords and 
other strings. 

The parsing module performs a parsing process relative to 
the keyword/text model to generate the structured document. 
The parsing module is generated by the parsing module in 
the following procedure. First, a given document structure 
definition is modified so as to match the document structure 
of the non -structured document, and difference therebe- 
tween is stored. Next, the parsing rule generating module 
converts the modified document structure definition into a 
parsing rule. In this case, when each rule is satisfied, i.e., 
when each element is detected, a program for recording 
information of the detected element in a corresponding 
position of the keyword/text model is embedded in the 
parsing rule. Then, the process of automatically generating 
a parsing module generates the parsing module which real- 
izes the parsing process described in the parsing rule. 

The parsing module generated in the above manner per- 
forms a parsing process relative to the keyword/text model 
generated by the keyword extracting module, and generates 
an interim structured document matching the modified docu- 
ment structure definition, in accordance with the parsing 
results recorded in the keyword/text model. A structured 
document correcting module refers to the difference stored 
when the document structure definition was modified, and 
output a structured document matching the document struc- 
ture definition before modification. 

A given layout definition and a second document structure 
definition support the generation of a keyword extraction 
rule used for extracting a keyword. The second document 
structure definition is generated by modifying a preset 
document structure definition so as to match the document 
structure of the input non-structured document. 

Specifically, the keyword extracting module comprises: 
means for extracting layout information from the given 
layout definition, the layout information including informa- 
tion about layout and string used when each element of the 
document structure is output; means for extracting informa- 
tion of connection between elements from the second docu- 
ment structure definition; means for supporting a determi- 
nation by a user of which element is extracted as the 
keyword, by using the information of connection between 
elements; and means for a user to edit layout information 
extracted from the layout definition so as to match the layout 
of the non-structured document. 

The means for editing layout information comprises: 
means for notifying the layout information extracted for 
each element of the document structure to the user, the 
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layout information being provided for each item necessary 
for extracting a keyword; and means for the user to modify 
the notified layout information so as to match the layout of 
the non -structured document or to supplement missing infor- 
mation. 5 

With the above structure, the document structure and the 
rule for analyzing the document structure are generated by 
modifying the document structure definition preset for each 
document. Therefore, labor required for the design of the 
document structure for document structure analysis and 10 
required for generating the rule can be reduced. Since the 
parsing rule dynamically generated in accordance with the 
document structure definition of each document is used, it is 
possible to directly generate the structured document match- 
ing the individual document structure without using the 15 
common document structure, and it is not necessary to 
convert the structured document from the format matching 
the common document structure into the format matching 
the individual document structure. 

20 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram illustrating the operation outline 
of a structured document generating system according to an 
embodiment of the invention. ^ 

FIG. 2 is a diagram showing an example of a non- 
structured document. 

FIG. 3 is a diagram showing part of DTD which is a 
document type definition of an SGML format set for the 
document shown in FIG. 2. 30 

FIG. 4 is a tree diagram showing part of DTD shown in 
FIG. 3. 

FIG. 5 is showing an example of a keyword extraction 
rule in part. 

35 

FIG. 6 is a diagram explaining a description constituent of 
the format condition of the keyword extraction rule shown 
in FIG. 5. 

FIG. 7 shows an example of extracted keywords, 

FIG. 8 shows an example of a keyword/text model. 40 

FIG. 9 is a block diagram illustrating the operation outline 

of a parsing rule generating module. 

FIG. 10 shows an example of a modified DTD in part. 
FIG. 11 shows an example of DTD difference data. 
FIG. 12 shows conversion rules to be referred to when the 

parsing rule generating module converts DTD into a yacc 

rule. 

FIG. 13 shows an example of an interim yacc rule in part. 
FIG. 14 shows an example of a parsing rule in part. 50 
FIG. 15 shows an example of an interim SGML document 
in part. 

FIG. 16 illustrates an example of a process by an SGML 
document correcting module. 

FIG. 17 shows an example of an SGML document finally 55 
generated by the embodiment method. 

FIG. 18 is a block diagram showing the hardware struc- 
ture of the structured document generation system of the first 
embodiment. 

60 

FIG. 19 is a diagram illustrating the process outline to be 
executed by the parsing module. 

FIG. 20 shows an example of a key word /text model with 
tag information being given. 

FIG. 21 is a block diagram illustrating the process outline 65 
to be executed by a keyword extraction rule generating 
system according to a second embodiment of the invention. 
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FIG. 22 shows an example of extraction of string- 
corresponding elements. 

FIG. 23 shows an example of the modified DTD shown in 
FIG. 10 described in BNF notation. 

FIG. 24 is a diagram illustrating the procedure of obtain- 
ing string-corresponding elements capable of appearing at 
the start of each element. 

FIG. 25 shows string-corresponding elements capable of 
appearing at the start and end of each element in the 
modified DTD described in BNF notation shown in FIG. 23. 

FIG. 26 is a diagram showing the contiguity relationship 
between string-corresponding elements in the modified 
DTD described in BNF notation shown in FIG. 23. 

FIG. 27 shows an example of string-corresponding ele- 
ment information. 

FIG. 28 shows an example of layout information. 

FIG. 29 shows an example of required items necessary for 
extracting a keyword. 

FIG. 30 shows an example of the process of extracting a 
required item from the layout definition. 

FIG. 31 is a diagram showing an example of an interface 
of a keyword information indicating module. 

FIG. 32 is a flow chart illustrating the processes to be 
executed by the keyword information indicating module. 

FIG. 33 is a diagram showing an interface of a supple- 
mentary information editing module. 

FIG. 34 is a flow chart illustrating the processes to be 
executed by the supplementary information editing module. 

FIG. 35 is a flow chart illustrating the process of gener- 
ating a format condition. 

FIG. 36 is a flow chart illustrating the processes to be 
executed by a contiguous element checking module. 

FIG. 37 is a diagram showing an example of the results 
processed by the contiguous element checking module. 

FIG. 38 is a block diagram showing the hardware struc- 
ture of the keyword extraction rule generating system of the 
second embodiment. 

DESCRIPTION OF THE PREFERRED 
EMBODIMENTS 

Embodiments of the invention will be described with 
reference to the accompanying drawings. In this 
embodiment, a structured document generating module ana- 
lyzes a document structure through parsing. As the struc- 
tured document format, an SGML (Standard Generalized 
Markup Language) format is adopted, and as the document 
structure definition, DTD (Document Type Definition) of an 
SGML document type definition is used. The process con- 
tents and description rules of SGML and DTD are stipulated 
in ISO (International Organization for Standardization) stan- 
dards IS08879. The details thereof are explained in "SGML: 
An Author's Guide to the Standard Generalized Markup 
Language", by Martin Bryan, Addison -Wesley, Publishers, 
1988. In this embodiment, yacc is used in a process of 
automatically generating a parsing module. C language is 
used for describing a process to be added when each rule to 
be inputted to yacc is satisfied. The details of a yacc process 
are explained in a document "How to Use yacc and lex" by 
Takashi SAJTHO, HBJ publishing division, and the C 
language is explained in a document "Programming Lan- 
guage C" by B. W. Kernighan and D. M. Ritchy, Kyoritsu 
Publishing Company. 

First, the outline of the first embodiment will be 
described. FIG. 19 is a diagram showing the hardware 
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structure of a structured document generating system of the 
first embodiment. An input/display device 1 receives an 
input entered by a user and displays an input non-structured 
document, a generated structured document, or the like. The 
input/display device 1 is constituted by a display, a 5 
keyboard, a mouse, or the like. An external repository unit 
2 stores a variety of data for structured document generation. 
This unit 2 is realized by a hard disk or the like and 
constituted by a non -structured document repository 21, a 
structured document generating rule repository 22, and a ]0 
structured document repository 23. A control unit 3 controls 
each device constituting the system, processes information 
for structured document generation, and is constituted by a 
controller 31, an internal memory 32, and a structured 
document generating unit 33. The controller 31 reads data ]5 
stored in the non-structured document repository 21 and 
structured document generating rule repository 22, develops 
it on the internal memory 32, executes processes of the 
structured document generating unit 33 on the internal 
memory 32 by using the developed data, and stores the 0Q 
generated structured document in the structured document 
repository 23. The processes to be executed include a 
process 34 of generating a parsing module and a process 35 
of generating a structured document. The parsing module 
generating process 34 constitutes part of the structured ,, 5 
document generating process 35. The structured document 
generating process 35 is a process of converting a non- 
structured document stored in the non-structured document 
repository 21 into a structured document by using a docu- 
ment structure definition, a keyword extraction rule, a rule , 0 
conversion regulation, and the like respectively stored in the 
non-structured document repository 21. The parsing module 
generating process 34 and the structured document gener- 
ating process 35 can be described by known programming 
languages. 35 

Next, the outline of processes of the first embodiment will 
be described. 

FIG. 1 is a block diagram showing a flow of the structured 
document generating process of the structured document 
generating system of the embodiment. A non-structured 40 
document 101 is electronic document information of 
sequential character strings generated by a word processor, 
a character recognition apparatus, or the like, and is input to 
the system from the input/display device 1, A keyword 
extraction module 102 extracts a keyword from the non- 45 
structured document in accordance with a keyword extrac- 
tion rule 103. A keyword is a character string expressing a 
document structure of the non -structured document 101. The 
keyword extraction module 102 then separates the non- 
structured document 101 into keywords and other strings 50 
and generates an abstract keyword/text model 104 as an 
aggregation of these elements of keywords and other strings. 
A parsing module 105 performs a parsing process described 
in a parsing rule 111 to analyze the document structure, the 
parsing rule 11L having been generated by a parsing rule 55 
generating module 110. 

The outline of a method of generating the parsing module 
105 is as follows. First, a DTD correcting module 107 
modifies a DTD 106 to generate a modified DTD so as to 
match the description formal of the non-structured document 60 
101, and stores difference information as DTD difference 
data 109. DTD 106 is a prepared standard document type 
definition and does not necessarily match the input non- 
structured document 101. This modification is therefore 
performed in accordance with a comparison result by a 65 
system user between the non -structured document 101 and 
DTD 106. The parsing rule generating module 110 refers to 
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a rule conversion regulation 112 and generates the parsing 
rule 111 from the modified DTD 108. Then, yacc 113, which 
is the process of generating a parsing module of this 
embodiment, generates the parsing module 105 in accor- 
dance with the parsing rule 111, the parsing module 105 
realizing a parsing process described by the parsing rule 111. 

The parsing module 105 performs a parsing process for 
the keyword/text model 104, and affixes a tag representative 
of the document structure to generate an interim SGML 
document 114. This document is a document instance 
formed in conformity with the modified DTD 108. 
Therefore, by referring to the DTD difference data 109, an 
SGML document correcting module 115 modifies the 
interim SGML document 114 to generate an SGML docu- 
ment 116 matching DTD 106. 

Each process of the embodiment will be detailed next. 

FIG. 2 shows an example of the non-structured document 
101 shown in FIG. 1. This document is obtained from an 
already present paper document regarding a law through 
character recognition. Although there is no explicit descrip- 
tion showing the document structure, this document has a 
layout of each component easy to read, using spaces or the 
like. In order for the document processing system to utilize 
such a text type electronic document, a document type 
definition (DTD) is set. FIG. 3 shows an example of DTD 
for the non-structured document shown in FIG. 2. The 
opening first line (line number 1, other lines are also 
represented by line numbers) indicates that the document 
structure definition has a name of "LAW". Second to sev- 
enteenth lines indicate definitions of elements. The name of 
an element is described after "! ELEMENT 5 , and after this a 
model group is described between "(" and ")". The model 
group is an aggregation of constituents which form ele- 
ments. These constituents are one or more elements and 
content tokens representative of data such as "#PCDATA", 
or model groups themselves disposed in a nest may be used 
as such constituents. The second line indicates that the 
element "LAW" is constituted by a series of elements of 
"PROMULGATION", "ESTABLISHED REGULA- 
TIONNO" "TITLE", and "PRESENTREGULATION". The 
third line indicates that the element "PROMULGATION" is 
constituted by a series of elements of 
" PROM ULGATI O NSTATEMENT", 
"PROMULGATION DATE", AND "PROMULGATION- 
OFFICER". The eleventh line indicates that the element 
"PRESENTREGULATION" is constituted by one or more 
"ARTICLES". The element affixed with "+" such as the 
"ARTICLE" means that more than one element may be 
used. The element affixed with an asterisk "*" means that the 
number of elements is optional. The element "#PCDATA" at 
the fourth, fifth, and seventh to tenth lines means that the 
corresponding elements 
" PROM ULGATIO NSTATEMENT", 
" P R O M U LG AT I O N D ATE " , "OFFICI ALTITLE", 
"NAME", "ESTABLISHEDREGULATIONNO", AND 
"TITLE" each have the string indicating the contents of the 
element. The document structure in a tree diagram is shown 
in FIG. 4. 

In this system, the document structure of a non-structured 
document such as shown in FIG. 2 is analyzed by directly 
using DTD such as shown in FIG. 3 to generate a structured 
document which matches DTD. 

The keyword extraction module 102 shown in FIG. 1 
refers to the keyword extraction rule 103 to extract a 
keyword from the non -structured document 101 and gener- 
ate the keyword/text model 104. An example of the keyword 
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extraction rule 103 is shown in FIG. 5. This rule is an 
aggregation of combinations of the name of an element to be 
extracted as the keyword and a layout condition which 
describes information about layout and string used for the 
extraction. In FIG. 5, the first item at each line is the name 5 
of a keyword, and the second and following items are the 
layout conditions. FIG. 6 gives an explanation of a descrip- 
tion constituent of the layout condition shown in FIG. 5. For 
example, the first line shown in FIG. 5 means that the format 
conditions of the keyword "OPENINGTITLE" are that a ]Q 
character "O" is at the three-space position from the line 
head, an optional length of string follows, and the line ends 
at a string "LAW" or "REGULATION". The fourth line 
means that the format conditions of the keyword "PRO- 
MULGATIONDATE" are that a string "SHOWA" or 
"TAISHO" is at the optional-space position from the line 35 
head, followed by INTEGER-*"YEAR"— INTEGER-* 
"MONTH"-MNTEGER-^"DAY" in this order lo end the 
line. 

The keyword extraction module 102 shown in FIG. 1 
checks whether there is a string in the electronic document 20 
which string matches the format conditions of the keyword 
extraction rule. If there is a matching string, it is extracted 
as the keyword (an example of an extracted keyword is 
shown in FIG. 7). Thereafter, the document is separated into 
keywords and other strings to generate the abstract keyword/ 25 
text model 104 which is an aggregation of keywords and 
other strings. Specifically, if there is a string which is not a 
keyword, between keywords, it is considered to be a "text" 
string other than keywords, and a keyword/text model such 
as shown in FIG. 8 is configured. The keyword/text model 30 
shown in FIG. 8 starts from the keyword 
"OPENINGTITLE", followed by a keyword 
" PROM ULG ATI OND ATE" -»a keyword 
"ESTABLISHEDREGULATIONNO."-*a keyword 
"PROMULGATIONSTATEMENT"-»a keyword "TITLE", 35 
-*a keyword "ARTICLENO". Since a string which is not a 
keyword is sandwiched between the keyword "ARTICLE 
NO/' and the next keyword "PARAGRAPH NO.", this 
string is considered as a text. 

There is a case wherein the same string in the same region 40 
of the document is extracted as a plurality of keywords. For 
example, in the example of the extracted keywords shown in 
FIG. 7, the string "OAAPREFECTUREFLOODDEFENC- 
ESIGNALREGULATION" at the first and second lines are 
extracted as the keyword of the keyword names of "OPEN- 45 
INGTITLE" and "TITLE". In such a case, it is assumed that 
the keywords are extracted from the same region and a 
plurality of keyword/text models corresponding to each 
keyword are generated. The keyword/text model shown in 
FIG, 8 is formed by selecting the "OPENING TITLE" from 50 
the region conflicting keyword names "OPENINGTITLE" 
and "TITLE". Of the plurality of keyword/text models, the 
model which the parsing module 105 fails to parse, is 
determined as an improper keyword/text model. If there is a 
plurality of keyword/text models which succeeded the 55 
parsing, an optimum one is selected in accordance with a 
criterion such as the number of extracted keywords so that 
a single SGML document is eventually generated from the 
optimum keyword/text model. 

The parsing module 105 shown in FIG. 1 performs a 60 
parsing process for the keyword/text model 104 in accor- 
dance with the parsing rule 111. First, the processes of 
modifying DTD 106 by the DTD correcting module 107 and 
generating the parsing rule 111 will be described with 
reference to FIG. 9. 65 

First, the DTD correcting module 107 manually generates 
a modified DTD 108 by modifying the description contents 
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of DTD 106 set for the non-structured document so as to 
match the description format of the non-structured 
document, and stores the difference as the DTD difference 
data 109. The reason why such correction becomes neces- 
sary is that there may be a contradiction of the description 
items and order between the non-structured document 101 
and DTD 106 used for this system. For example, although 
DTD 106 shown in FIG. 3 is prepared for the non-structured 
document 101 shown in FIG. 2, the element for the opening 
title "OAA PREFECTURE FLOOD DEFENCE SIGNAL 
REGULATION" at the first line shown in FIG. 2 is not given 
in DTD 106 shown in FIG. 3. In DTD 106 shown in FIG. 3, 
elements are disposed in the order of "PROMULGA- 
TI ONSTATEM ENT-* PR O M U LG ATI ON DATE-* ESTA- 
BLISHEDREGULATIONNO — TITLE", whereas in the 
non-structured document shown in FIG. 2, the elements are 
disposed in the order of "PROMULGATIONDATE-* 
ESTABLISHEDREGULATIONNO.-*PROMULGATION- 
STATEMENT-*TITLE". 

In order to eliminate such contradiction, the modified 
DTD 108 shown in FIG. 10 is manually generated. The 
meshed portion in FIG. 10 shows the modified elements. In 
order to explicitly indicate the modified portion, this portion 
is included by an element <CHANGE>. The modified 
portion of the original DTD 106 is stored as the DTD 
difference data 109 such as shown in FIG. 11. Also in this 
case, the modified portion is included by the element 
<CHANGE>. 

If there is no contradiction of the document structure 
between the non-structured document and DTD 106, it is not 
necessary to generate the modified DTD 108 and DTD 
difference data 109. 

After DTD 106 is modified where necessary, the parsing 
rule generating module 110 executes a rule conversion 
process 906 in accordance with the rule conversion regula- 
tion 112 shown in FIG. 12 to convert the element definition 
described in the modified DTD 108 into an interim yacc rule 
908. Each rule for an interim (hereinafter called a "produc- 
tion rule ,") is constituted by right and left sides partitioned 
by a colon ":" such as "A : B C;". If there is a pattern 
described at the right side is present, the rule is satisfied and 
the element at the left side is configured. In this example of 
the production rule of "A : B C;", an element A is generated 
if a pattern "B C" is present. 

In DTD, the production rule having the right side of 
"#PCDATA" means that the left side element corresponds 
directly to the string of the document structure analysis 
result. In converting the production rule into the interim yacc 
rule, if the left side element is an element extracted as a 
keyword in accordance with the keyword extraction rule 
shown in FIG. 5, then #PCDATA is converted into [#KEY 
"(KEYWORDNAME)"]. tfPCDATA in the other production 
rule is converted into "#TEXT" meaning a string other than 
the keyword. For example, the production rule converted 
into [OPENINGTITLE: #KEY "OPENINGTITLE"] indi- 
cates that the keyword "OPENINGTITLE" corresponds to 
the element "OPENINGTITLE". The production rule con- 
verted into [ARTICLESTATEMENT: #TEXT] indicates that 
a string other than the keyword corresponds to the element 
"ARTICLESTATEMENT". 

FIG. 13 shows an example of the yacc rule converted 
from the modified DTD shown in FIG. 10. For example, the 
definition at the fifth line shown in FIG. 10 is converted into 
the product rules at the fourth and fifth lines shown in FIG. 
13. In this case, the "PROMULGAXIONSTATEMENT ?" 
shown in FIG. 10 is converted into "optO" at the fourth fine 
shown in FIG. 13 in accordance with the second bottom line 
rule shown in FIG. 12. The definition of "optO" is described 
at the fifth line of FIG. 13. 
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If such an interim yacc rule is used, the parsing module 
generated by yacc outputs only a success/failure of parsing 
and does not output the correspondence between the 
keyword/text model and elements. However, in order to 
generate the structured document by using the results of 
parsing, it becomes necessary, when each element analysis 
succeeds, i.e., when each interim rule is satisfied, to add, to 
the keyword/text model, information (hereinafter called "tag 
information") indicating which element corresponds to each 
constituent of the keyword/text model. To this end, the 
parsing rule generating module 110 executes a C language 
program embedding process 909 for the interim yacc rule 
908 in order to add the tag information to the keyword/text 
model and generate the parsing rule 111. An example of the 
parsing rule 910 is shown in FIG. 14. The meshed portions 
illustrate the process of the embedded C language programs. 
In this process, pieces of tag information corresponding to 
the right side elements of the production rule are coupled 
and the tag information corresponding to the left side 
elements of the production rule is generated. 

Referring back to FIG. 1, yacc 113 receives the generated 
parsing rule 111 and generates a parsing module 105 which 
performs a parsing process in accordance with the parsing 
rule 111. Manual operation required during the process of 
generating the parsing module 105 from DTD 106 is only 
the operation of changing the document structure definition 
so as to match the description format of the non-structured 
document and generating the DTD difference data 109. The 
other operations are automatically performed. 

The parsing module 105 analyzes the document structure 
for the keyword/text model 104 to verify whether the 
keyword/text model 104 matches the parsing rule 111, and 
adds the tag information representative of the document 
structure detected during this process to the keyword/text 
model 104. The interim SGML document 114 is generated 
from the keyword/texl model added with Ihe tag informa- 
tion. 

Keywords and texts (hereinafter collectively called a 
" token*') of the keyword/text model both correspond to 
"#PCDATA" in DTD of the tree diagram shown in FIG. 4. 
i.e., to the string representing the contents of each element. 
The keyword is a string in one-to-one correspondence with 
each element, whereas the text is a string having no corre- 
spondence with each element yet. The parsing process 
corresponds to generate the tree structure shown in FIG. 4 
from the one-dimensional arrangement of keywords and 
texts, i.e., the keyword/text model. 

The outline of this process by the parsing module 105 is 
illustrated in FIG. 19. The parsing module 105 generated by 
yacc 113 is constituted by a state transition table 2004 and 
a parser 2003 which performs the parsing process while 
referring to the state transition table 2004. Described in the 
state transition table 2004 are tokens acceptable in a certain 
state of parsing, and information on to which state of parsing 
is changed when a token is accepted. The parser 2003 
sequentially reads a token starting from the opening token, 
the tokens being a constituent of the keyword/text model 
2001 (2005). If it is judged in a certain state that the input 
token cannot be accepted, it is judged that parsing failed 
(2006->2007). Conversely, if acceptable, the state of parsing 
advances one step in accordance with the state transition 
table (2006-^2008). In this state, if any one of the produc- 
tion rules of the parsing rule 111 can be satisfied, the tag 
information corresponding to the production rule is added to 
the keyword/text model 2001 (2009-^2010: this process is 
realized by the inserted programs shown in FIG. 14). 
Specifically, if a single token corresponds to a certain 
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element, start-tag information and end-tag information rep- 
resentative of the name of the element are added to the token 
as a pre-tag and a post -tag. For the elements corresponding 
to a plurality of tokens, the start-tag information and end-tag 

5 information are added to the start and end tokens. The details 
of adding tag information will be later detailed. 

When the last token is input and if the parsing changes to 
the state of "normal termination", it is judged that the 
document structure analysis of the keyword/text model has 

io succeeded. 

The process when a production rule is satisfied during the 
parsing will be detailed with reference to the keyword/text 
model shown in FIG. 8 and the rule shown in FIG. 13. This 
process realizes the following two functions. 

15 (1) To what element a keyword or text corresponds is 
determined. For example, if the keyword 
"ARTICLENO." at the sixth line of the keyword/text 
model shown in FIG. 8 is input, the production rule at the 
thirteenth line of FIG. 13 is satisfied (which production 

20 rule is satisfied in a certain state is described in the state 
transition table 2004), and the keyword "ARTICLENO." 
corresponds to the element "ARTICLENO.". In this case, 
the start-tag information and end-tag information of the 
"ARTICLENO.*' are added to the pre-tag and post-tag of 

25 the keyword "ARTICLENO." of the keyword/text model 
(seventeenth and eighteenth lines in FIG. 20). Next, when 
the text at the seventh line of FIG. 8 is input, the 
production rule at the fourteenth line of FIG. 13 is 
satisfied so that this text is considered to correspond to the 

30 element "ARTICLESTATEMENT*. The start-tag infor- 
mation and end-tag information of the "ARTI- 
CLESTATEMENT" are added to the pre-tag and post-tag 
of the TEXT (twenty first and twenty second lines in FIG. 
20). 

35 (2) Adjacent elements are summarized to a more abstract 
element. 

For example, in FIG. 4, the adjacent elements "PARA- 
GRAPHNO." and "PARAGRAPHSTATEMENT" are sum- 
marized to a more abstract "PARAGRAPH*'. In the example 

40 of the keyword/text model shown in FIG. 8, the adjacent 
"PARAGRAPHNO." and the text (corresponding to 
"PARAGRAPHSTATEMENT") at the eighth and ninth lines 
are summarized to the one element "PARAGRAPH" in 
accordance with the production rule at the sixteenth line of 

45 FIG. 13. If this production rule is satisfied, the start-tag 
information of "PARAGRAPH" is added to the keyword 
"PARAGRAPHNO." at the eighth line of FIG. 8, and the 
end-tag information is added to the text at the ninth line 
(twenty fourth and twenty eighth lines in FIG. 20). The same 

50 operation is performed for the combinations of tenth and 
eleventh lines, twelfth and thirteenth lines, and fourteenth 
and fifteenth lines in FIG. 8. 

The adjacent "ARTICLENO." (sixth line) and "ARTI- 
CLESTATEMENT" (seventh line) and a plurality of 

55 "PARAGRAPHS" (eighth to fifteenth lines) can be summa- 
rized to the element "ARTICLE" in accordance with the 
production rules at the twelfth and fifteenth lines in FIG. 13. 
In this case, the start -tag information of "ARTICLE" is 
added to the pre-tag of the keyword "ARTICLENO." at the 

60 sixth line, and the end-tag information is added to the 
post-tag of the text at the fifteenth line (in FIG. 20, only the 
addition of the start- tag information of "ARTICLE" is 
illustrated at the seventeenth line). 

If the elements are summarized whose constituents are 

65 keywords representing a number such as "ARTICLE" and 
"PARAGRAPH" (in this case, "ARTICLENO." and 
"PARAGRAPHNO ."), the first number and the continuity 
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between numbers are checked. Namely, it is checked 
whether the number begins with "L" and thereafter the 
numbers 1, 2, 3, . . . are continuous. 

The above process is sequentially performed for an input 
token of the keyword/text model 104. If the tree structure 5 
shown in FIG. 4 having one root (in the example shown in 
FIG. 4, "LAW") can be obtained, it is judged that the 
keyword/text model 104 matches the parsing rule 111 and 
the parsing has succeeded. Conversely, if a token input in a 
certain state during the parsing is not acceptable, i.e., if the jo 
keyword/text model 104 does not match the parsing rule 111, 
it is judged that the parsing has failed. If in the continuity 
check of numbers of the function (2) described above, the 
first number is abnormal or the continuity between numbers 
is not retained, it is judged that the document structure 15 
analysis has failed. For example, such cases corresponding 
to the number 3 instead of starting from the number 1 or the 
numbers are skipped as in 1, 2, and 5. 

If the parsing has succeeded, the parsing module 105 
outputs the interim SGML document 114 in accordance with 20 
the tag information given to the keyword/text model 104. 
Specifically, the output interim SGML document 114 has 
tags corresponding to the start-tag information and end-tag 
information and added to the front and back of a string 
corresponding to each token of the keyword/text model 104. 25 
An example of the interim SGML document 114 is shown in 
FIG. 15. 

As seen from this example, the tag information includes 
the start-tag information and end-tag information, and the 
end-tag information is not always positioned near the start- 30 
tag information. For example, although the end-tag infor- 
mation </ARTICLENO.> for the start-tag information 
<ARTICLENO.> is just two lines below, the end-tag infor- 
mation </ARTICLE> for the start-tag information 
<ARTICLE> is far below the drawing space. Therefore, if 35 
the document structure is to be manually modified when the 
interim SGML document is generated, it is required to 
search the corresponding start-tag information and end-tag 
information over the whole of the document, requiring a 
large amount of labor. In this embodiment, necessary modi- 40 
fication is completed at the stage of DTD so that the 
generated interim SGML document 114 matches the input 
non-structured document 101 and the modification 
described above is not necessary. 

If a plurality of keywords are extracted from the same 45 
region, a plurality of keyword/text models are generated. In 
this case, the parsing process is performed for all the 
keyword/text models. If an erroneous keyword is contained, 
the parsing fails. If there are a plurality of keyword/text 
models which have succeeded in the parsing, an optimum 50 
keyword/text model is selected in accordance with, for 
example, the condition that there are a large number of 
extracted keywords, and a corresponding interim SGML 
document is output. This will be described by using an 
example shown in FIG. 7 in which two keywords "OPEN- 55 
INGTITLE" and "TITLE" are extracted from the same 
string of the non-structured document. The keyword/text 
model generated by selecting the "TITLE" fails in the 
parsing because the first line in the modified portion of the 
modified DTD stipulates that the "OPENINGTITLE" can 60 
appear at the top of the "LAW" but the "TITLE" cannot 
appear at the top of the "LAW". Therefore, the interim 
SGML document for the keyword/text model generated by 
selecting the "TITLE" is not output. On the other hand, the 
keyword/text model generated by selecting the "OPENING- 65 
TITLE" succeeds in the parsing, and the corresponding 
interim SGML document is output as shown in FIG. 15. 
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If there is the DTD difference data 109, the SGML 
document correcting module 115 modifies the interim 
SGML document 114 in accordance with the DTD differ- 
ence data. The contents of a particular process will be 
described with reference to FIG. 16. The SGML document 
correcting module 115 generates an instance 1602 of modi- 
fied part in DTD which is a partial SGML document 
corresponding to the contents described in the DTD differ- 
ence data 109. In this case, a string "#PCDATA" represent- 
ing the contents of the document structure is required to be 
replaced by a corresponding string. A change module 1603 
for the interim SGML document replaces the string by 
another string representative of the contents of the element 
having the same name. For example, the "#PCDATA" 
sandwiched between the two tags PROMULGATION- 
STATEMENT AND </PROMULGATIONSTATEMENT> 
in the instance 1602 of modified part in DTD is replaced by 
a string "AAPREFECTUREFLOODDEFENCESIGNAL- 
R E G U L AT IONISTOBEPROMULG AT E D A - 
SINTHEFOLLOWING" sandwiched between the same 
tags, in the changes 1603 in the interim SGML document. 
Similarly, the "#PCDATA" sandwiched between the two 
tags <PROMULG AT I O N D AT E > and 

</PROMULGATIONDATE> is replaced by a string 
"SHOWA 24, OCTOBER, 6", and the "#PCDATA" sand- 
wiched between the two tags <ESTABLISHEDREGULA- 
TION NO > and </ESTABLISHEDREGULATIONNO.> is 
replaced by a string " AAPREFECTUREREGULA- 
TIONN0.78". As in the case of the "#PCDATA" sand- 
wiched between the two tags <OFFICIALTITLE> and 
</OFFICIALTITLE> in the instance 1602 of modified part 
in DTD, whose element having the same name is not 
included in the changes 1603 in the interim SGML 
document, a string "NONE" is forcibly inserted. 

The instance 1602 of modified part in DTD generated by 
the replacement process is replaced by the modified portion 
of the interim SGML document 114 of FIG. 1, i.e., in the 
example shown in FIG. 15, the portion sandwiched between 
the two tags <CHANGE> AND </CHANGE>. In this 
manner, the SGML document matching DTD 106 preset for 
subject documents can be generated. An example of the 
SGML document 116 is shown in FIG. 17. Since the 
individual document structure is directly reflected upon the 
SGML document, it is not necessary as in the conventional 
case to convert the document instance into the individual 
document structure. 

Programs realizing the first embodiment may be stored in 
a storage device such as a hard disk, a floppy disk, and an 
optical disk. 

According to the first embodiment described above, the 
parsing rule 111 used for the document structure analysis is 
directly generated from the document structure definition set 
for subject documents. It is therefore possible to reduce 
labor required for the generation of a rule. Since the docu- 
ment instance is generated through parsing in accordance 
with the document structure described in the document 
structure definition of each document, it is not necessary to 
convert the document instance obtained through parsing, 
from the format matching the common document structure 
into the format matching the individual document structure. 

Next, the second embodiment will be described. This 
embodiment pertains to a method of supporting to generate 
the keyword extraction rule 103 by using the modified DTD 
and a given layout information. 

Similar to the first embodiment, also in this second 
embodiment, an SGML format is adopted as an example of 
the structured document format, and as the document struc- 



6,014,680 

15 16 

ture definition, a DTD is used which is a document type A keyword extraction rule generating module 2207 
definition for SGML set for subject documents. informs via an input/display device 2211 an operator of the 
FIG. 38 is a diagram showing the hardware structure of a required item content for each string-corresponding element 
keyword extraction rule generating system of the second i n ihe layout information 2206. This module 2207 receives 
embodiment. An input/display device 3910 receives an input 5 information entered by the operator, modifies the required 
entered by a user and displays an information about layout. item conte nt, and generates a keyword extraction rule 2212 
a generated keyword extraction rule, or the like. The input/ in accor dance with the modified required item content, 
display device 3910 is constituted by a display, a keyboard, The ^ b Jhe keyword extraction rule generating 
a mouse, or the hke An external repository' unit 3920 stores modulc 220? win fae described in more particular A key . 
a variet y of data for keyword extraction rule generation. This 3Q ^ information indicator module 2208 Moms the 
unit 3920 is realized by a hard disk or the like and consti- - . - , . . t . \. . 
tuted by a modified DTD repository 3921, a layout definition lor °J the name of a string-correspondmg element described 
repository 3922, a string-corresponding element information in lhe stnng-corresponding element information 2203. If a 
repository 3923, a layout information repository 3924, and string-corresponding element is set as a keyword- 
a keyword extraction rule repository 3925. A control unit corresponding element and given a format condition, this 
3930 controls each device constituting the system, processes 15 format condition is also displayed together with the string- 
information for keyword extraction generation, and is con- corresponding element. 

stituted by a controller 3931, an internal memory 3932, and A supplementary information editing module 2209 sets 
a keyword extraction rule generating module 3933. The the format condition of each string-corresponding element, 
controller 3931 reads data stored in the modified DTD The supplementary information editing module 2209 refers 
repository 3391 and layout definition repository 3922. 20 to the layout information 2206 and displays the required 
develops it on the internal memory 3932, executes processes item content of the string-corresponding element selected by 
of the keyword extraction rule generating module 3933 on the operator. If the displayed required item content is dif- 
the internal memory 3932 by using the developed data, and ferent from the layout and strings of the non-structured 
stores the generated string-corresponding element informa- document, the operator corrects it. The content of the 
tion and layout information respectively in the string- 25 required item is given by the operator if it cannot be 
corresponding element information repository 3923 and extracted by the layout information extracting module 1105. 
layout information repository 3924. The processes to be In this manner, all the required item contents are edited so 
executed include a process 3934 of extracting document that they match the layout and strings of the non-structured 
structure information and a process 3935 of extracting document. After a 11 the required items are edited, the supple- 
layout information. A process 3936 of generating a keyword 30 mentary information editing module 2209 generates the 
extraction rule notifies an operator via the input/display format condition used for keyword extraction by using the 
device 3910 of the string-corresponding element informa- required item contents. By using the layout condition as a 
tion stored in the string-corresponding element information return argument, the process is passed to the keyword 
repository 3923 and the layout information stored in the information indicator module 2208. 
layout information repository 3924, and receives if neces- 35 The keyword information indicator module 2208 sets as 
sary supplementary information from the operator via the the keyword-corresponding element the string- 
input/display device 3910. The process 3934 of extracting corresponding element whose format condition was gener- 
docuraent structure information, the process 3935 of extract- ated by the supplementary editing module 2209, and dis- 
ing layout information, and the process 3936 of generating plays the layout condition together with the element name, 
a keyword extraction rule can be described by known 40 With the above processes, each keyword -corresponding 
programming languages. element is determined. A contiguous element checking mod- 
Next, the outline of processes of the second embodiment ule 2210 inspects at a certain timing whether an aggregation 
will be described. of keyword-corresponding elements satisfies the restriction 
FIG. 21 is a block diagram showing a flow of the keyword condition that non-keywords should not be contiguous. The 
extraction rule generating system. Reference numeral 2201 45 contiguous element checking module 2210 refers to the 
represents a modified DTD (same as DTD 108 shown in contiguity relationship between string-corresponding ele- 
FIG. 1) obtained by modifying the document structure ments described in the string-corresponding element infor- 
definition set for subject documents so as to match an input mation 2203, and inspects whether string-corresponding 
non-structured document. The modified DTD 2201 defines elements other than the keyword-corresponding elements 
elements of the non-structured document and the relation- 50 (hereinafter called "non-keyword-corresponding elements") 
ship between elements. A document structure information are contiguous. If there is a possibility that two non- 
extracting module 2202 refers to the modified DTD 2201 keyword-corresponding elements are contiguous, the ope ra- 
and generates string-corresponding element information tor generates the layout condition of one of the two elements 
2203 describing elements in direct correspondence with a and sets it as the keyword-corresponding element, 
string (hereinafter called a "string-corresponding element") 55 Conversely, if there is no possibility that non-keyword- 
and a contiguity relationship between elements. corresponding elements are contiguous, keyword- 
Reference numeral 2204 represents a layout definition set corresponding elements are sufficient at this timing. At this 
for subject documents which defines with what layout each time, an aggregation of combinations of the name of each 
element is output. A layout information extracting module keyword-corresponding element and its format condition is 
2205 refers to the layout definition 2204 and extracts items 60 used as the keyword extraction rule 2212. 
necessary for generating a keyword extraction rule as many The outline process of the keyword extraction rule gen- 
as possible from the layout used for outputting each element erating system has been described above. Next, the details of 
and from the information of an output string. Each item itself each process executed by the system shown in FIG. 21 will 
is hereinafter called a "required item", and the information be described. 

extracted for each item is called a "required item content". 65 The document structure information extracting module 

Layout information 2206 describes the required item content 2202 refers to the modified DTD 2201 such as shown in FIG. 

for each string-corresponding element. 10, extracts each string-corresponding element and contigu- 
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ity possibility information between string-corresponding In order to obtain an aggregation of string-corresponding 

elements, and outputs them as (he string-corresponding elements capable of appearing at the start of each element, 

element information 2203. the procedure A is executed by using the element as the 

The string-corresponding element is an element having argument (nt in FIG. 24). 
"#PCDATA" representative of a string of the document type 5 \ n lne procedure A, First[nt] is set to an empty aggregation 
definition (moditied DTD) as a constituent of the model (2 501), First[nt] representing an aggregation of string- 
group. FIG. 22 shows the string-corresponding elements of corresponding elements capable of appearing at the start of 

the ™ dl £ ed DTD Sh J ° Wn f n FIG ' 10 ' In the Sh ° Wn nl. In the nt content model, of the element groups partitioned 

in MG. 22, extracted as the st r in g-cor responding elements by aQ 0 R-connector «|'\ the first element group is substi- 

? n e n ^ w m i n I tt^m J^a t r » ' ]0 tuted int0 tne variable mg (2502). If the OR-connector does 

PROMULGAIIONDA1E, , ... . . e tl f , ( , . . , . , . . . 

» r c t a n r i c 1 1 r ™ d r n i t r attammh » no1 exist, the whole ot the content model is substituted into 

ESTABLISHLDREGULATIONNO. , . „, „ . e 

" P R 0 M U L G AT IONS TAT E M E N T " " T I T L E " variable mg. 1 ne first element of mg is substituted into 

" A R T I C L E N 0 " " A R T I C L E S TAT E M E N T " ' the var i at?le e ^ em (2503). Next, it is checked whether elem 

"PARAGRAPH NO.",' and "PARAGRAPHSTATEMENT". is a string-corresponding element (2504). If elem is a 

The document structure information extracting module 15 string-corresponding element, elem is added to Firstfnt] 

2202 checks a possibility of contiguous string- ( 2505 ) and llie llow advances to step 2509, whereas if not, 

corresponding elements. The following two specific pro- the content of Firstfelem] is added to First[nt] (2508) if 

cesses are performed. First[elem] has been set (2506) and the flow advances to step 

(1) An aggregation of string-corresponding elements at the 2509. If First[elem] is not set at step 2506, elem is used as 
start and end of each element is obtained. For example, in 20 the argument and the procedure A is recursively executed 
the structured document shown in FIG. 15, at the start of (2507). The return argument, i.e., the content of First[elem] 
the element "PROMULGATION" (1501 to 1506), the is added to First[nt] and the flow advances to step 2509. 
string-corresponding element "PROMULGATION- At step 2509, it is checked from the content model of nt 
DATE" (1502 to 1503) appears, and at the end of the whether mg is the last element group partitioned by the 
element "PROMULGATION", the string-corresponding 25 OR-connector. If not, the next element group is substituted 
element "PROM ULG ATI ONSTATEMENT" (1504 to into the variable mg (2510) and the flow returns to step 2503. 
1505) appears. In this process, the elements capable of If mg is the last element group, by using Firstfnt] as the 
appearing at the start and end of each element are derived return argument, the processing is passed to the procedure 
from the modified DTD 2201 such as shown in FIG. 10. which called this procedure A (2511). 

(2) A combination of contiguous elements in the model 30 The procedure shown in FIG. 24 is performed until 
group of the modified DTD is obtained. There is a Firstfnt] is set for all elements. In this manner, an aggrega- 
contiguity possibility of each combination between the tion of string-corresponding elements capable of appearing 
string-corresponding elements capable of appearing at the at the start of each element can be obtained. In order to 
end of the preceding element and at the start of the obtain an aggregation Lastf ] of string-corresponding ele- 
succeeding element. 35 ments capable of appearing at the end of each element can 
In this embodiment, in order to facilitate the execution of be obtained in the similar manner as the procedure shown in 

these two processes, the modified DTD such as shown in FIG. 24 by replacing the factors shown in FIG. 24 by the 

FIG. 10 is converted to have notation of BNF (Buckus Naur following two factors. 

Form). This conversion procedure conforms with the rule (a) Firstfxxx] in FIG. 24 is replaced by Lastfxxx], 

conversion regulation 112 (FIG. 12) and is generally the 40 (b) The first element at step 2503 is replaced by the last 

same as the procedure of converting the moditied DTD 108 element. 

into the interim yacc rule 908. However, in this embodiment, FIG. 25 shows Firstf ] and Lastf ] of the aggregations of 

which element is determined as a keyword is not known. string-corresponding elements capable of appearing at the 

Therefore, the description U #PCDATA" of the modified start and end of each element of the modified DTD shown 

DTD is not converted into the description of [#KEY 45 in FIG. 10. 

"ARTICLENO."] or [#TEXT], Only in this point, this With the above procedures, it becomes possible to obtain 

embodiment differs from the rule conversion process 906. the aggregation Firstf ] of string-corresponding elements 

FIG. 23 shows an example of the modified DTD capable of appearing at the start of each element and the 

expressed by BNF notation. Also in this embodiment, a rule aggregation Lastf ] of string-corresponding elements 

described in BNF notation and obtained by converting the 50 capable of appearing at the end of each element, 

definition of each element of the modified DTD is called a Next obtained is a combination of contiguous elements in 

"production rule". The right side of each production rule, in the content model of the document structure definition, 

this embodiment, is called a "content model" of the left side There is a contiguity possibility of each combination 

element. between component of Lastf ] of a preceding element and a 

The procedure of obtaining from the modified DTD 55 component of Firstf ] of a succeeding element. An example 
expressed by BNF notation an aggregation of string- of this process is illustrated in FIG. 26 in which the 
corresponding elements at the start and end of each element, production rule "CHANGE: OPENINGTITLEPROMUL- 
will be described. The algorithm of this procedure is shown GATIONTITLE" 2402 shown in FIG. 23 is processed. In 
in FIG. 24. The procedure starting from A in FIG. 24 uses as this production rule of the content model of the element 
an input argument an element, and as a return argument an 60 "LAW", the elements "OPENINGTITLE" and "PROMUL- 
aggregation of string-corresponding elements capable of GATION" are contiguous and the elements "PROMULGA- 
appearing at the start of the element, and contains a recursive TION" and "TITLE" are contiguous (2701). Therefore, the 
call. The variables mg and elem used in this procedure are element in FirstfPROMULGATION] can be backward con- 
local variables newly generated each time the procedure tiguous with the element in Lastf OPENINGTITLE] (2702). 
advances to A. Firstfxx] is a global variable representative of 65 Namely, the string-corresponding element "PROMULGA- 
an aggregation of string-corresponding elements capable of TIONDATE" can be backward contiguous with the string- 
appearing at the start of the element xx. corresponding element "OPENINGTITLE" (2704). The ele- 
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ment in First[TITLE] can be backward contiguous with the 
element in Last[ PROMULGATION] (2703). Namely, the 
string-corresponding element "TITLE" can be backward 
contiguous with both the string-corresponding elements 
"PROM ULGATI ONSTATEM ENT" and " ESTABLISH ED- 
REG UL ATI ONNO." (2705). This process is applied to all 
production rules in the document structure definition 
expressed in BNF notation. Therefore, an aggregation of all 
string-corresponding elements capable of being backward 
contiguous can be obtained, and this aggregation is the 
string-corresponding element information (2203 in FIG. 21). 
An example of the string-corresponding element informa- 
tion 2203 is shown in FIG. 27. 

With the procedure described with the drawings up to 
FIG. 26, the document structure information extracting 
module 2202 can generate the string-corresponding element 
information 2203. 

Next, the process of the layout information extracting 
module 2205 shown in FIG. 21 for extracting the layout 
information 2206 from the layout definition 2204 will be 
described. 

The layout definition 2204 is set for subject documents 
and de lines with what layout each element is output. FIG. 18 
shows an example of the layout definition in part prepared 
for structured documents conforming with the document 
type definition (DTD). Reference numeral 2901 indicates 
that reference numerals 2901 to 2911 represent the layout 
definitions of the element "TITLE". A [font name] 2902 
indicates that the font name used for output! ing "TITLE" is 
Gothic, and a [font size] 2903 indicates that the font size is 
12 pi (point) which is a length unit and 1 pt-Vk inch. A 
[character pitch] indicates that the character pitch of 
"TITLE" is 14 pt. An [offset 1] 2905 and an [offset 2] 2906 
indicate what minimum spaces from the right and left sides 
of a region where a document is output are reserved for 
outputting the content of "TITLE''. A [first-line 
displacement] 2907 indicates a difference from the [offset 1] 
of an offset of the first line which often takes a different 
offset from other lines. A [connection with previous element] 
2908 indicates which string is output after an element just 
before. In this example, after an element just before is 
output, the "TITLE" is output on a new line after line feed. 
A [string information] 2909 describes which string is output. 
In this example, a string CONTENT corresponding to the 
"TITLE", i.e., the string between the tag <TITLE> and tag 
</TITLE>, is output. A [placement] 2910 indicates how 
strings are placed between the area defined by the [offset 1] 
and [offset 2]. This [placement] 2910 takes four values 
"start", "end", "center", and "justify" corresponding to the 
left alignment, right alignment, centering, and equal space. 
In this example, the siring of "TITLE" is output through 
centering. 

Such layout definitions are essentially used for outputting 
a structured document and are not used for expressing the 
layout of a no n -structured document. However, for a docu- 
ment having a regular layout such as legal documents, the 
layout definition is often determined in accordance with the 
layout regularity. Most of pieces of information of layout 
and string in the layout definition of such a document can be 
used for extracting keywords from the non-structured docu- 
ment. 

The layout information extracting module 2205 refers to 
the layout definition 2204 and extracts items necessary for 
extracting a keyword as many as possible from the infor- 
mation of layout and string used for outputting each element. 
As described earlier, this item itself is called a "required 
item", and the information extracted for each item is called 
a "required item content". 
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FIG. 29 shows an example of required items for each 
keyword when the keyword rule shown in FIG. 5 is gener- 
ated. An [element name] 3001 is the name of a subject 
string-corresponding element and takes a value of a string. 

5 A [left-hand space] 3002 and a [right-hand space] 3003 
indicate the conditions of what minimum character spaces 
from the right and left sides of a region where a document 
is output are reserved for outputting the string of the 
element. A [first-line indent] 3004 indicates what character 

io spaces at the left side are reserved at the first line which often 
takes a different offset from other lines. A [string condition] 
3005 indicates what string describes the keyword. An 
[arrangement] 3006 indicates how keywords are arranged in 
the region defined by the [left-hand space] 3002 and [right- 

3 5 hand space] 3003. This [arrangement] 3006 lakes four 
values "right justify", "left justify", "centering" and "equal 
space". A [previous string] 3007 and a [next string] 3008 
indicate strings which show what strings are sandwiched 
between string-corresponding elements appearing before 

20 and after the subject keyword. 

The layout information extracting module 2205 refers to 
the layout definition 2204 and extracts information of the 
required items shown in FIG. 29, i.e., the required item 
contents, as much as possible. FIG. 30 illustrates an example 

25 of a process of extracting the required item contents from the 
layout definition shown in FIG. 28. 

In order to extract the required item content of a string- 
corresponding element, the definition of the string- 
corresponding element in the layout definition is used. For 

30 example, the required item for the "ARTICLENO." is 
extracted from the definitions 2912 to 2922 of the 
"ARTICLENO." shown in FIG. 28. 

The required items [left-hand space] and [right-hand 
space] are the items indicating the same contents of the 

35 [offset 1] and [offset 2] of the layout definition. Therefore, 
only the unit of length is changed from pt to the number of 
characters. Specifically, the values of the [offset 1] and 
[offset 2] are divided by the value of the [character pitch] 
(3101 and 3102). The required item [first-line indent] has the 

40 content of the sum of the [offset 1] in the layout definition 
and [first-line displacement] divided by the 1674[*character 
pitch] (3103]. The content of the required item [string 
condition] is generated by referring to the [string 
information] in the layout definition (3104). However, in the 

45 example shown in FIG. 28, the [string information] is 
"CONTENT" for all elements so that the string in the 
document instance itself is output and specific information 
of a string cannot be obtained from the layout definition. 
Since the required item [arrangement] is the item represent- 

50 ing the same concept as the [placement] in the layout 
definition so that the values are converted in accordance 
with the rules 3105. Into the content of the required item 
[previous string], the content of the [connection with pre- 
vious element] is substituted (3106). 

55 The content of the required item [next string] is obtained 
by using the string-corresponding element information and 
the [connection with previous element] of other elements in 
the layout definition (3107). Specifically, first a string- 
corresponding element (hereinafter called a "next element") 

60 backward contiguous with the subject string-corresponding 
element is obtained by using the string-corresponding ele- 
ment information. Next, the [connection with previous 
element] is checked for all next elements, and if the contents 
of all next elements are the same, this content is set as the 

65 content of the [next string] of the [next string]. If there is a 
next string having the different content of the [connection 
with previous element], the content of the [next string] is not 
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set. For example, from the string-corresponding element format conditions are given to the two string-corresponding 

information shown in FIG. 27 at 2806, it can be known that element of the "TITLE" 3206 and "PAR AG RAP UNO." 

the next string of "ARTICLENO." is only "ART1- 3207, which means that the two string-corresponding ele- 

CLE STATEMENT". The content of the [next string] of ments are set as the keyword-corresponding elements. 
"ARTICLENO." is " " of the [connection with previous 5 Reference numeral 3204 represents a button for checking 

element of "ARTICLESTATEMENT"]. contiguous elements. As this button 3204 is clicked, the 

The above processes are executed for all string- contiguous element checking module (2210 in FIG. 21) is 

corresponding elements to generate the layout information activated which inspects whether an aggregation of 

2206 shown in FIG. 21. keyword-corresponding elements set at this timing satisfy 

The keyword extraction rule generating module 2207 10 the restriction condition that non -keywords should not be 
shown in FIG. 21 informs via the input/output device 2211 contiguous (3306). The operation of the contiguous element 
an operator of the string-corresponding element information checking module 2210 will be later described. If the inspec- 
2203 and layout information 2206. This module 2207 tion judges that the keyword-corresponding elements satis- 
receives supplementary information from the operator to add fying the restriction condition are set, the operator clicks an 
and modify the required item content and generate the is exit button to instruct to terminate the process of the 
keyword extraction rule 2212. A specific process of the keyword information indicating module 2208. The keyword 
keyword extraction rule generating module 2207 will be information indicator module 2208 outputs the keyword- 
described, corresponding element name and its format condition as the 

The keyword information indicator module 2208 informs keyword extraction rule (2212 in FIG. 21) and terminates the 

the operator of the string-corresponding element name and 20 process (3307). The contents of the processes by the key- 

which string-corresponding element is set as the keyword- word information indicator module 2208 have been 

corresponding element at a certain timing. If the operator described above. 

instructs to set a particular string-corresponding element to FIG. 33 shows an example of an interface of the supple- 
the keyword-corresponding element, the keyword informa- mentary information editing module 2209 activated when 
tion indicator module 2208 activates the supplementary 25 the element name is double clicked during the operation of 
information input module 2209 which supplements the the keyword information indicator module 2208, and FIG. 
required item content of the string-corresponding element. If 34 shows the process flow. The supplementary information 
the operator instructs to inspect whether set keyword- editing module 2209 reads the name of the string- 
corresponding elements satisfy at that timing the restriction corresponding element set as the keyword -corresponding 
condition that non-keywords should not be contiguous, the 30 element whose layout condition is to be set, the name being 
contiguous element checking module 2210 is activated. passed from the keyword information indicator module 2208 
FIG. 31 shows an example of an interface for the keyword (3501), and reads the required item content of the element 
information indicator module 2208 to display information from the layout information (2206 in FIG. 21 (3502). The 
on the input/display device 2211 for the operator, and FIG. required item content is displayed on a required item editor 
32 is its process flow. The operation of the keyword infor- 35 3401 (3503). The required item editor 3401 consists of 
mation indicator module 2208 will be described with refer- windows in which the display content can be edited. If the 
ence to FIGS. 31 and 32. Upon activation, the keyword display content is different from the description format of the 
information indicator module 2208 reads the string- non-structured document, the operator changes its content, 
corresponding element information 2203 and obtains the Since the required item content (e.g., [string condition] in 
name of each string -corresponding element (3301). Refer- 40 the extraction example shown in FIGS. 30 and 31) which 
ence numeral 3202 represents a keyword information win- cannot be extracted by the layout information extracting 
dow which is constituted by an element name display area module 2205 is not displayed on the required item editor, the 
3202 for displaying the names of all string-corresponding operator enters the required item content to the required item 
elements and a format condition display area 3203 for editor (3504 to 3503). An example after the [string 
displaying the format condition of for the string- 45 condition] is entered is shown in FIG. 30 under the title of 
corresponding element set as the keyword-corresponding "after entering string condition". 

element. At step 3202, the string-corresponding element After the required item contents are edited and all the 

name and the layout condition of an element set as the required item contents match the description format of the 

keyword-corresponding element at this timing are displayed. non-structured document, the operator clicks an exit button 

In this case, at the initial stage, the format condition is not 50 3402 to instruct the termination of the processes of the 

set to any element so that the format condition display area supplementary information editing module 2209. The 

3202 displays no information. In order to give the format supplementary information editing module 2209 generates 
condition to a string-corresponding element and set this the format conditions from the edited required item contents 
element as the keyword-corresponding element, the operator of the string-corresponding elements set as the keyword- 
first double clicks the element name in the element name 55 corresponding elements (3506), and passes the format con- 
display area 3202 with a mouse to thereby activate the ditions as the return argument to the keyword information 
supplementary information editing module (2209 in FIG. indicator module 2208 (3507). The process flow of gener- 
21) (3304). The detailed operation of the supplementary ating the format condition from the required item content is 
information editing module 2209 will be given later. The shown in FIG. 35. This process flow is added with an 
string-corresponding element name is passed to the supple- 60 example of steps surrounded by a broken line in FIG. 35 
mentary information editing module 2209, and its format which step converts the required item content of 
condition is received as the return argument. The string- "ARTICLENO." shown under the title of "after entering 
corresponding element designated by the operator is set as string condition" into the format condition. 

the keyword-corresponding element (3305) and its format First, the content (e.g., "ARTICLE"NUM1) of the 

condition is displayed in the format condition display area 65 required item [string condition] is substituted into the format 

3203 (3302). In the example shown in FIG. 31, a display at condition, and it is checked whether the content of the 
the interface at a certain timing is shown. At this timing, the required item [previous string] is line feed (3601). If line 
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feed, the flow advances to step 3603, whereas if not, the 
format condition is surrounded by and and "+" and 
the content of the [previous string] are added just before it 
(3602). In this case, a blank is converted into SPC [integer]. 
Next, at step 3603 it is checked whether the content of the 
required item [next string] is line feed. If line feed, is 
added to the end of the format condition (3605) and the How 
advances to step 3606, whereas if not, the format condition 
is surrounded by "[" and if the format condition does not 
contain "[" and "]'* and the content of the [next string] and 
"+" are added just after it (3604, e.g., ["ARTICLE"NUM1 
SPC1+]). At step 3606 it is checked whether the content of 
the required item [arrangement] is "centering" or not. If 
"centering", "C" is added to the start of the format condition 
(3607) and the generation of the format condition is termi- 
nated. If not "centering", the flow advances to step 3608 and 
the process A or B is executed depending upon the content 
of the [arrangement]. If the content of the [arrangement] is 
"left justify", the process A is performed, if "right justify", 
the process B is performed, and if "equal space", both the 
processes A and B are performed, to thereafter terminate the 
generation of the format condition. In the process A/'SPCx" 
is added to the start of the format condition (3609) where x 
is the content of the [first-line indent] (e.g., "SPC0 
[" ARTICLE" NUM 1] SPC1+). In the process B, first 
"SPCyS" is added to the end of the format condition (3610) 
where y is the content of the [right-hand space. Next, if 
or "+" at the start of the format condition, "!" is added to the 
start of the format condition (3611). 

The supplementary information editing module 2209 
passes the obtained format condition as the return argument 
to the keyword information indicating module (3507 in FIG. 
34) which in turn executes the process. The above descrip- 
tion is the contents of the processes by the supplementary 
information indicating module 2209. 

FIG. 36 shows the process Mow of the contiguous element 
checking module 2210 activated when the contiguity check 
button is clicked during the operation of the keyword 
information indicating module (2208 in FIG. 21), and FIG. 
37 shows an example of its processes. The contiguous 
element checking module 2210 first reads the keyword- 
corresponding element given by the keyword information 
indicating module 2208 (3701, e.g., 3801). Next, it reads the 
string-corresponding element information (2203 in FIG. 21) 
(3702). Then, non-keyword -corresponding elements are 
obtained as an aggregation of all string-corresponding ele- 
ments subtracted by the keyword-corresponding elements 
(3703, e.g., 3802). At step 3704, by referring to the string- 
corresponding element information, it is checked whether 
there is a non-keyword corresponding element in the next 
element of another non-key word -corresponding element 
(e.g., 3803). If there is such a non-keyword corresponding 
element, the operator is informed of the contiguous non- 
keyword-corresponding element (3705, e.g., 3804) to there- 
after terminate the process. If there is not, the operator is 
informed of such effect (3706) to thereafter terminate the 
process. The above description is the process contents of the 
contiguous element checking module 2210. 

With this embodiment, the keyword extraction rule can be 
generated. The programs described with this embodiment 
may be stored in a storage such as a hard disk, a floppy disk, 
an optical disk, and a CD-ROM. 

What is claimed is: 

1. A method of generating a structured document for a 
structured document generating apparatus having at least an 
input/output device, a control unit, and a repository wherein 
a non-structured document not explicitly given the docu- 
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ment structure and input from said input/output device is 
converted into a structured document explicitly given the 
document structure, in accordance with a document struc- 
ture definition defining the document structure, said method 
5 comprising the steps of: 

modifying a given first document structure definition so as 
to match the document structure of said input non- 
structured document and generate a second document 
structure definition; 
10 by said control unit, generating a parsing rule used for 
performing a parsing process suitable for the document 
structure of said second document structure definition, 
by modifying marks constituting said second document 
structure definition and modifying said second docu- 
]5 ment structure definition so as to make the positional 
order of said marks in one-to-one correspondence; 
in accordance with said generated parsing rule, generating 
a first structured document from said non-structured 

document; and 
20 * 

in accordance with difference data between said first 

document structure definition and said second docu- 
ment structure definition, converting said generated 
first structured document into a format matching said 
05 first document structure definition to thereby generate a 
second structured document. 

2. A method of generating a structured document accord- 
ing to claim 1, wherein said first and second document 
structure definitions include mark trains disposed for defin- 

30 ing the relationship between character strings constituting a 
document to be input. 

3. A method of generating a structured document accord- 
ing to claim 2, wherein said parsing rule is generated by 
embedding a process of explicitly giving the parsed portion 

35 of document structure to be parsed, into an interim rule 
generated by converting said second document structure 
definition in accordance with a given rule conversion regu- 
lation. 

4. A method of generating a structured document accord- 
40 ing to claim 2, wherein the mark strings of said first and 

second document structure definitions describe the docu- 
ment structure, representing a conceptional relationship 
between the character strings of a document to be input, by 
disposing names representing the concept of each character 
45 string. 

5. A method of generating a structured document accord- 
ing to claim 2. further comprising the steps of: 

extracting a keyword from said no n -structured document 
in accordance with a predetermined rule regarding the 
50 character strings of a document to be input, and gen- 
erating a keyword/text model including at least char- 
acter strings extracted as keywords and other character 
strings; and 

converting said keyword/text model into said first struc- 
55 tured document by using said parsing rule. 

6. A method of generating a structured document accord- 
ing to claim 5, wherein if the same character string in the 
same character region is extracted as a plurality of 
keywords, said control unit selects a proper one from the 

60 plurality of keywords in accordance with whether the pars- 
ing process succeeds or fails. 

7. A method of generating a structured document accord- 
ing to claim 5, wherein said keyword is extracted by 
analyzing each character string in said non-structured docu- 

65 ment with reference to a keyword extraction rule having a 
correspondence between a format condition of each charac- 
ter string and a keyword name. 
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8. A method of generating a structured document accord- 
ing to claim 7, wherein said keyword extraction rule is 
generated, if a layout definition of said non-structured docu- 
ment is given, by modifying said layout definition in accor- 
dance with a predetermined rule. 5 

9. A storage device storing a program realizing a process 
executable by a computer, the process comprising the steps 
of: 

modifying a given lirst document structure definition so as 
to match the document structure of an input non- 30 
structured document and generate a second document 
structure definition; 

a control unit generating a parsing rule used for perform- 
ing a parsing process suitable for the document struc- 
ture of said second document structure definition, by 



,680 

26 

modifying marks constituting said second document 
structure definition and modifying said second docu- 
ment structure definition so as to make the positional 
order of said marks in one-to-one correspondence; 
in accordance with said generated parsing rule, generating 
a first structured document from said input non- 
structured document; and 
in accordance with difference data between said first 
document structure definition and said second docu- 
ment structure definition, converting said generated 
first structured document into a format matching said 
first document structure definition to thereby generate a 
second structured document. 

***** 
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